Mike Ciaccio

Univariate Plots Section

This report investigates the wineQualityReds.csv dataset, consisting of
13 variables for 1599 observations.

Quick look - column names and first observation

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4              0.7           0            1.9     0.076
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
##   quality
## 1       5


structure - wineQualityReds.csv

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...


Initial Exploratory Data Analysis

My primary interest is the relationship between red wine quality and
alcohol content. Also of interest is the relationship between pH,
and the other 3 acidity metrics, fixed.acidity, volatile.acidity, citric.acid.

Drill down into the individual attributes.
Red Wine quality analysis.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   5.000   6.000   5.636   6.000   8.000 


The quality attribute approximates a normal distribution.

Red Wine alcohol analysis.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.40    9.50   10.20   10.42   11.10   14.90 


The alcohol distribution is right skewed, log10 plot follows.

Red Wine log10 alcohol analysis.

The alcohol log10 plot approximates a normal distribution.

Red Wine pH analysis.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.740   3.210   3.310   3.311   3.400   4.010 


The pH attribute approximates a normal distribution.

Red Wine fixed acidity analysis.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   4.60    7.10    7.90    8.32    9.20   15.90 


The fixed acidity attribute approximates a normal distribution.

Red Wine volatile acidity analysis.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1200  0.3900  0.5200  0.5278  0.6400  1.5800 


The volatile acidity attribute approximates a normal distribution.

Red Wine citric.acid analysis.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.090   0.260   0.271   0.420   1.000 



Red Wine residual sugar analysis.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.900   1.900   2.200   2.539   2.600  15.500 


The residual sugar distribution is right skewed, log10 plot follows.

Red Wine log10 residual sugar analysis.
The residual sugar log10 plot approximates a normal distribution.

Red Wine density analysis.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9901  0.9956  0.9968  0.9967  0.9978  1.0037 


The density attribute approximates a normal distribution.

Red Wine chlorides analysis.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01200 0.07000 0.07900 0.08747 0.09000 0.61100 


The chlorides distribution is right skewed, log10 plot follows.

Red Wine log10 chlorides analysis.

The chlorides log10 plot approximates a normal distribution.

Red Wine free sulfur dioxide analysis.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    7.00   14.00   15.87   21.00   72.00 


The free sulfur dioxide distribution is right skewed, log10 plot follows.

Red Wine log10 free sulfur dioxide analysis.

Red Wine total sulfur dioxide analysis.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6.00   22.00   38.00   46.47   62.00  289.00 


The total sulfur dioxide distribution is right skewed, log10 plot follows.

Red Wine log10 total sulfur dioxide analysis.
The total sulfur dioxide log10 plot approximates a normal distribution.

Red Wine sulfates analysis.

The sulfates attribute approximates a normal distribution.

Univariate Analysis

What is the structure of your dataset?

The dataset sourced from wineQualityReds.csv has 1599 entries each with 13 features.

The 11 num features of interest are:
fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulfates, alcohol

quality is a type int feature
The quality feature range is 3 through 8.
3 represents a lower quality wine while 8 indicates a higher quality wine.
quality was analyzed as a categorical variable.

wineQualityReds.csv was downloaded from
https://s3.amazonaws.com/ udacity-hosted-downloads/ud651/wineQualityReds.csv&sa=D&ust=1530252422777000

What is/are the main feature(s) of interest in your dataset?

My interest is exploring a possible relationship between alcohol and quality.
For example does alcohol content appear to influence the subjective quality rating of the wine.

I am also interested in the 4 acidity metrics pH, fixed.acidity, volatile.acidity, and citric acid. I am interested in how these metrics relate to each other, and how they relate to the quality metric.

I will use Bivariate Plots, Bivariate Analysis, Multivariate Plots, and
Multivariate Analysis to further my analysis of the alcohol, quality and acidity metrics.

The following red wine characteristics natively demonstrated a normal distribution -

  • quality
  • pH
  • fixed.acidity
  • volatile.acidity
  • density
  • sulfates

The following red wine characteristics natively demonstrated a skewed distribution -

  • alcohol
  • residual.sugar
  • chlorides
  • free.sulfur.dioxide
  • total sulfur dioxide

Follow up log10 plotting showed a near normal distribution for the following -

  • alcohol
  • residual.sugar
  • chlorides
  • total sulfur dioxide

Initial plot of citric.acid did not reveal a recognizable distribution.
Follow up free.sulfur.dioxide log10 plot did not reveal a recognizable distribution.

Bivariate Plots Section

Spearman correlation interpretation.

Focus on the relationship between quality and alcohol.



summary statistics - quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

summary statistics - alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90




Focus on the relationship between pH and acidity metrics.



pH summary statistics

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

volatile acidity summary statistics

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

fixed acidity summary statistics

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

citric acid summary statistics

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000




Red Wine alcohol - quality analysis, alcohol percent by volume.
Higher quality rated wines, 6, 7, and 8 contain progressively higher alcohol.

Red Wine alcohol - quality analysis, quality percent by volume.
Quantify the strength of the quality - alcohol relationship.
Pearson’s r


    Pearson's product-moment correlation

data:  quality and alcohol
t = 21.639, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4373540 0.5132081
sample estimates:
      cor 
0.4761663 

Correlation Interpretation
quality - alcohol correlation is greater than the accepted meaningful correlation threshold - 0.3
quality - alcohol correlation less than the accepted moderate correlation threshold - 0.5

Leverage the above quality - alcohol visualizations to fine tune correlation analysis.
Quantify the strength of the quality - alcohol relationship - quality > 4.
Pearson’s r


    Pearson's product-moment correlation

data:  quality and alcohol
t = 23.962, df = 1534, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4845165 0.5573539
sample estimates:
      cor 
0.5218858 

Correlation Interpretation - subset quality > 4
subset quality - alcohol correlation is greater than the accepted moderate correlation threshold - 0.5
subset quality - alcohol correlation is less than the accepted large, strong correlation threshold - 0.7
Higher quality red wines have a higher alcohol correlation.


Alcohol and quality proportional.

quality - acidity metrics analysis

sorted by Pearson’s r correlation ascending


Pearson’s r - quality - volatile acidity correlation


    Pearson's product-moment correlation

data:  wineQualityReds$quality and wineQualityReds$volatile.acidity
t = -16.954, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.4313210 -0.3482032
sample estimates:
       cor 
-0.3905578 

quality - volatile acidity negative correlation - meaningful but weak

Pearson’s r - quality - pH correlation


    Pearson's product-moment correlation

data:  wineQualityReds$quality and wineQualityReds$pH
t = -2.3109, df = 1597, p-value = 0.02096
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.106451268 -0.008734972
sample estimates:
        cor 
-0.05773139 

quality - pH negative correlation less than the accepted weak threshold - -0.3

Pearson’s r - quality - fixed acidity correlation


    Pearson's product-moment correlation

data:  wineQualityReds$quality and wineQualityReds$fixed.acidity
t = 4.996, df = 1597, p-value = 6.496e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.07548957 0.17202667
sample estimates:
      cor 
0.1240516 

quality - fixed acidity correlation less than the accepted weak threshold - 0.3

Pearson’s r - quality - citric.acid correlation


    Pearson's product-moment correlation

data:  wineQualityReds$quality and wineQualityReds$citric.acid
t = 9.2875, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1793415 0.2723711
sample estimates:
      cor 
0.2263725 

quality - citric acid correlation less than the accepted weak threshold - 0.3

acidity metrics - pH analysis

sorted by Pearson’s r correlation ascending


Pearson’s r - pH - fixed acidity correlation


    Pearson's product-moment correlation

data:  wineQualityReds$pH and wineQualityReds$fixed.acidity
t = -37.366, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7082857 -0.6559174
sample estimates:
       cor 
-0.6829782 

pH - fixed acidity - moderate correlation

Pearson’s r - pH - citric acid correlation


    Pearson's product-moment correlation

data:  wineQualityReds$pH and wineQualityReds$citric.acid
t = -25.767, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.5756337 -0.5063336
sample estimates:
       cor 
-0.5419041 

pH - citric acid - moderate correlation

Pearson’s r - pH - volatile acidity correlation


    Pearson's product-moment correlation

data:  wineQualityReds$pH and wineQualityReds$volatile.acidity
t = 9.659, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1880823 0.2807254
sample estimates:
      cor 
0.2349373 

pH - volatile acidity correlation less than the accepted weak threshold - 0.3

Bivariate Analysis

Red wine: quality rating - alcohol content analysis.
  • The red wine quality rating and alcohol content exhibited a positive correlation.
  • The higher the red wine quality the stronger the quality - alcohol correlation.
  • The one exception is the drop in alcohol content associated with quality rating increase from 4 to 5.
Next the relationships between the quality ratings
and the acidity metrics were analyzed.
  • red wine quality - volatile acidity
  • red wine quality - pH
  • red wine quality - fixed acidity
  • red wine quality - citric acid

  • summary: quality - acidity metrics observations
    • volatile acidity had the strongest negative correlation - meaningful but weak
    • citric acid had the strongest correlation - although that correlation is less than weak
Next the relationships between the acidity metrics and pH were analyzed.

reminder - pH and acidity are inversely proportional

  • pH - fixed acidity - strongest correlation - strength: moderate
  • pH - citric acid - correlation strength: moderate
  • pH - volatile acidity - correlation strength: less than weak

Multivariate Plots Section


For most quality ratings, alcohol increase is accompanied by pH increase (lower acidity).

Red wine pH, quality follow up analysis.

Quality tends to increase as pH, alkalinity increases.
Quality tends to increase as alcohol percent by volume increases.


A comprehensive definitive pH trend is not illustrated.
No discernible color gradient pattern, striations observed.


A comprehensive definitive Fixed Acidity trend is not illustrated.
No discernible color gradient pattern, striations observed.


A comprehensive definitive Volatile Acidity trend is not illustrated.
No discernible color gradient pattern, striations observed.


A comprehensive definitive Citric Acid trend is not illustrated.
No discernible color gradient pattern, striations observed.

Multivariate Analysis

For most quality ratings, alcohol increase is accompanied by pH increase (lower acidity).

There was not a comprehensive explicit relationship seen when pH was color
plotted on the quality alcohol plot.

There was not a comprehensive explicit relationship seen when Fixed Acidity was
color plotted on the quality alcohol plot.

There was not a comprehensive explicit relationship seen when Volatile Acidity
was color plotted on the quality alcohol plot.

There was not a comprehensive explicit relationship seen when Citric Acid was
color plotted on the quality alcohol plot.


Final Plots and Summary


For most quality ratings, alcohol increase is accompanied by pH increase (lower acidity).
I chose this plot because it communicates the relationship between alcohol content and pH,
broken out by quality rating.
Pearson’s r mathematically augments the visualizations.


I chose this plot because it communicates the strong relationship between alcohol content and quality rating.
Per the Centers for Disease Control and Prevention -
“Alcohol use slows reaction time and impairs judgment…”
CDC’s acknowledgement of the effect alcohol has on judgement indicates further analysis is indicated.
Does alcohol content effect subjective quality rating reporting?
Further analysis would include a controlled experiment where alcohol concentration was the only variable.
The results would be used to further analyze and refine the relationship between alcohol concentration
and the subjective quality rating.


I chose this plot because of the observed relationship between Volatile Acidity and pH.
The plot suggests a rise in pH (more alkaline, less acid measurements), as Volatile Acidity increases.
Initially one might assume if a measure of acidity such as Volatile Acidity increased
there would be a corresponding decrease in pH, a more acidic pH measurement.


I chose this plot because it illustrates the relationship between:

Higher quality wines tend to have centric pH values, and higher alcohol content.

Additional studies - density - alcohol - quality analysis.
I chose this plot because it illustrates the relationship between:

Higher quality wines tend to have lower density and higher alcohol content,


Reflection

What were some of the struggles you went through?

What was surprising?
The unexpected relationship between Volatile Acidity and pH.
As Volatile Acidity increased, pH increased. pH is a measure of acidity.
It was unexpected to see the Volatile Acidity increase yield a more alkaline, less acidic pH result.

Future work.
Further investigate the relationship between alcohol content and subjective quality rating.
Next step - Additional quality rating analysis based on a controlled experiment where alcohol concentration
is the only variable.